Automatic subject-specific spatiotemporal feature selection for subject-independent affective BCI

The dimensionality of the spatially distributed channels and the temporal resolution of electroencephalogram (EEG) based brain-computer interfaces (BCI) undermine emotion recognition models. Thus, prior to modeling such data, as the final stage of the learning pipeline, adequate preprocessing, transforming, and extracting temporal (i.e., time-series signals) and spatial (i.e., electrode channels) features are essential phases to recognize underlying human emotions. Conventionally, inter-subject variations are dealt with by avoiding the sources of variation (e.g., outliers) or turning the problem into a subject-deponent. We address this issue by preserving and learning from individual particularities in response to affective stimuli. This paper investigates and proposes a subject-independent emotion recognition framework that mitigates the subject-to-subject variability in such systems. Using an unsupervised feature selection algorithm, we reduce the feature space that is extracted from time-series signals. For the spatial features, we propose a subject-specific unsupervised learning algorithm that learns from inter-channel co-activation online. We tested this framework on real EEG benchmarks, namely DEAP, MAHNOB-HCI, and DREAMER. We train and test the selection outcomes using nested cross-validation and a support vector machine (SVM). We compared our results with the state-of-the-art subject-independent algorithms. Our results show an enhanced performance by accurately classifying human affection (i.e., based on valence and arousal) by 16%–27% compared to other studies. This work not only outperforms other subject-independent studies reported in the literature but also proposes an online analysis solution to affection recognition.


Introduction
A lack of understanding of neurophysiological signals has resulted in numerous unanswered questions about human beings, their health, and their cognitive and social development, as well as human-to-human and human-to-machine interaction. Traditionally, communication can be categorized into active (direct)  communication is contextual and can be interpreted, while passive communication involves uncertainty, and interpreting it is difficult. All means of communication (e.g., written, verbal, body language) result from neural activation in the brain. Human affection, one of the most complicated areas of investigation in cognitive and behavioral psychology, whether it occurs intrinsically (e.g., only neural activation) or extrinsically (e.g., via any ordinal method of communication), is hard to predict, communicate, and interpret. The urge to understand human brain and emotions has advanced technologies and emerging interdisciplinary fields (e.g., affective computing [1]) to recognize, interpret, and process human affects.
Emotion is a multifaceted phenomenon that can be hard to quantify for many reasons, namely, 1) individuality and event-dependence (e.g., two different people might react differently to a similar situation); 2) the existence of a broad, overlapping spectrum of emotional states; and 3) the fleeting property of emotions (i.e., emotion is momentary, as opposed to mood, which lasts longer and can be a personal trait) [2]. Two primary methods for identifying emotions: Darwinian method (i.e., universal basic emotions) and cognition (i.e., dimensional measures of valence and arousal) [3]. Cognition is a quantitative approach in which two-dimensional (i.e., arousal and valence) continuous scales represent each emotion from which other emotions can be derived. These affective measures might then correlate with neural activation in brain regions that can be detected using invasive and non-invasive techniques.
EEG is a non-invasive technique that utilizes the electrical potential from the firing of millions of neurons projected on the scalp as spatially distributed electrical potential. Compared to other non-invasive techniques, EEG has gained the interest of industry and the research community [4] thanks to 1) effective temporal resolution, measured in milliseconds; 2) reasonable cost compared to fMRI; and 3) portability and suitability for different experimental setups. However, EEG suffers from low spatial resolution and artifact interference, such as muscle movement, power frequency, and eye-blinking.
Despite the disagreement in the emotion literature regarding which regional sources are significant in identifying human emotions, it is believed that identification of positive and negative emotions occurs in the frontal, temporal, and parietal lobes. Zhang et al. et al. applied a channel selection procedure based on the ReliefF feature selection algorithm [5], and they concluded the electrode channels, Fp1 and T7, among other channels in the lobes mentioned above, were important. Channels Fp1 and Fp2 were frequently selected in a subject-specific sequential forward search proposed by [6]. In [7][8][9], the authors thoroughly reviewed and investigated different channel and feature extraction selection methods trained on several classifiers. Channel selection methods for detecting event-related potential in single-trial studies are discussed in [10]. Spatiotemporal features, however, exhibit apparent inter-subject variation, which led researchers to select a subset of subjects with less variation or prefer subjectdependent modeling instead.
In this article, we address the subject-specific particularities to recognize emotions crosssubjects. Subject-to-subject variation has been identified as one reason for low performance in subject-independent emotion recognition studies. However, subjective particularities have not been actively considered when constructing affective recognition systems. The contribution of this work includes studying subject-specific brain connectivity for individuals participating in the experiments, described in the next section, over several epochs; feature and channel (i.e., spatiotemporal features) selection using an online unsupervised learning approach; encapsulating these algorithms into the proposed affective recognition framework to recognize emotions based on features that are relevant to the participating individuals and improve the overall performance of the subject-independent affective recognition.

Datasets
DEAP is short for the Dataset of Emotional Analysis using Physiological Signals [11]. It involves 32 subjects, each of whom has watched 40 musical video clips that vary in their emotional content. The dataset consists of a multi-channel multidimensional array of size 32 � 40 � 40 � 8064 (representing subject � video � channel � data/signal). We only used 32 neurological oriented sensors. We excluded other physiological measures except for electrooculography (EOGs) for ocular artifacts removal. At the end of the session, subjects rated their affection regarding that stimulus (60 s). The frequency rate of this experiment is 512 Hz, and all data are available.
MAHNOB-HCI [12] consists of 27 subjects who watched 20 movie excerpts varying in time between (34.9-117 s) in a frequency of 512 Hz. Subjects reported their valence and arousal feedback, among other assessment reports. This experiment is similar in technical setup to DEAP, except that it does not include EOG sensors.
DREAMER dataset [13] does also test human affection through 18 music videos that vary in time (65-393 s). Twenty-three participants watched and evaluated those stimuli. The experimenter used Emotiv EPOC consisting of 14 electrode channels, recording a rate of 128 Hz.

Subject dependency
Emotion recognition studies can mainly be categorized into two schemes, subject-independent or subject-dependent. A subject-independent scheme is stimulus-dependent, where different neurophysiological features from different subjects contribute to the prediction outcomes based on their affective feedback. In this scheme, each stimulus is a unit, and each subject is a data point. Each participant provides feedback on their emotional states in response to stimuli in the environment. Conversely, subject-dependent learning is another scheme of affective learning implemented extensively in recent affective computing studies. Compared to the subject-dependent scheme, subject-independent emotion recognition is closer in practice to real-life applications, and it is generalizable across a population. In the subject-independent scheme, we aim to verify the validity and effect of the event from different perspectives, not only from the perspective of a single respondent. One drawback of the subject-independent scheme is that it ignores the particularities of individuals and generalizes the learning process despite the subjective differences between participants. In this work, we propose a mechanism involving personal neurological specificities to enhance the outcomes of the subject-independent scheme.

Methodology
We propose an extended BCI pipeline for the automatic subject-specific unsupervised feature and channel selection for a subject-independent scheme of affect modeling and recognition. Although we mainly contributed to the channel selection, the framework was enhanced in different parts such as the unsupervised feature selection and ranking through epochs. An overview of the proposed framework is shown in Fig 1. It consists of the following phases: 1) data (i.e., EEG and EOG signals) acquisition, 2) epoching the time-series data 3) preprocessing each epoch, 4) feature extraction, 5) unsupervised channel and feature selection, 6) supervised learning, 7) evaluation and analysis.

Time slicing (Epoching)
Event-related sensors' signals might be underestimated when a global analysis of the given signals is applied. Therefore, we believe dealing with smaller windows of time would lead to a higher temporal resolution; hence, feature extraction will be able to extract more relevant event-aware features [14].
We define an epoch size, referred to as w, and that is the synchronization point where all subjects hypothetically need to reach simultaneously for a given stimulus. w's form individual units of time in which feature extraction methods are applied. We have set 3 window sizes, 2, 5, 10 seconds. Investigating the window sizes is not in the scope of this work.

Signal preprocessing
EEG signals are susceptible to external sources of noise (i.e., artifacts). The main external sources of EEG artifacts are power lines, eye-blinking, and muscle movement. Avoiding these undesired artifacts is almost impossible; however, post-acquisition processes might be

PLOS ONE
Automatic subject-specific spatiotemporal feature selection for subject-independent affective BCI beneficial to reduce a fair amount of these inevitable artifacts. In [9], a comprehensive survey on the EEG-based emotion studies shows that preprocessing techniques were overwhelmingly considered, which suggests the complexity, yet vitality, of this phase. In this paper, we consider the following EEG preprocessing procedure as suggested by [15,16]: • Notch filter: We removed power line noise at 50 Hz (i.e., the nominal frequency in the European standard of power) using notch filter.
• Band-pass filter: We applied a band-pass filter (0.5 Hz-40 Hz) corresponding to Nyquist frequency.
• Ocular noise: • To detect the eye-blinking momenta, we referenced the vertical EOG sensors (i.e. bipolar reference).
• Down-sampling: We down-sampled to 128 Hz to mitigate the computational and storage complexities.
This preprocessing procedure was slightly toned for each dataset due to dataset-based variations such as equipment and setups. For example, when dataset does not include ocular sources to detect eye-blinks, prefrontal electrodes can alternatively be referenced. ICA used to be a bottleneck in automating BCI applications because of the need to manual rejection of bad components. However, several algorithms to automate ICA components rejection have been developed such as [17,18].

Feature extraction
After preprocessing EEG signals, we can extract features from the time-series data. Feature extraction methods can be in time-domain, frequency-domain or time-frequency. A large combination of features can be resulted from the multidimensional EEG signals and the many feature extraction algorithms presented in the EEG literature.
We use a variety of the most common feature extraction methods that have been extensively applied in BCI and emotion recognition applications [19,20]. Below, we list the most frequently used methods in both time, frequency and time-frequency domains that we use in this work. See [19,21] for more details. We extracted 54 features from each electrode channel (i.e. 1728 features of a 32-channel system) per subject. In parenthesis (.) are the tags we used in the figures below as a reference for the reader.

Unsupervised feature extraction selection
Feature selection is a dimensional reduction process that aims to represent data in a lowdimensional space by selecting a subset of its original feature space. It approaches this goal by eliminating irrelevant and redundant features. Based on where a feature selection method is designed to occur in the learning pipeline, feature selection can be generally categorized into four categories, filter, wrapper, embedded, and hybrid methods [9,22]. The latter two methods are derived from either filter or wrapper or a combination of both. Filter methods employ the selection criteria regardless to the outcomes of the classification algorithm, thus reducing the effect of over-fitting. In contrast, wrapper methods is highly dependent on the process of learning where the outcome of the learner decides whether or not certain features are to be eliminated. Unlike filter methods, wrapper methods involve extensive computation since the majority of the learning pipeline recursively search for an optimized selection of features.
This framework is to select features in an online unsupervised manner. For this purpose, several unsupervised algorithms such as Laplacian score and dispersion ratio have been reported in many reported studies for feature selection. Here, we use Laplacian score feature selection to rank the time and frequency features for a subject-specific episodic signals generated from different sources (i.e. channels). Laplacian score algorithm works as follows: • Construct a k-nearest neighbor graph G.
• Define a similarity matrix S populated as (1): • Compute the graph Laplacian for each feature r using L = D − S, where D is the diagonal matrix.
• Finally, a Laplacian score is calculated for each feature f r , and the goal is to minimize the objective function. Features with lower scores are ranked higher.

Unsupervised subject-specific channel selection algorithm
Channel selection can contextually be seen as a sub-domain of feature selection. EEG signals are multi-dimensional where a number of channels detect and record simultaneous signals from different regions of the brain. Temporal signals are then transformed to more informational features.
We present a novel algorithm that aims to find a representative subset of the channel space by learning from inter-channel interactivity while subjects are still in session. Basing this work on unsupervised similarity-based learning, channel selection known as a time-consuming task can now be online. The ultimate goals to this phase is to minimize unnecessary redundancy in the channel space and reduce computational and storage overheads.
3.5.1 Inter-channel connectivity learning. Here, we propose an inter-channel connectivity-based learning algorithm that learns subject-specific connectivity patterns over epochs.
Notation: In this section, let X C�F T be a subject-trial-independent data encapsulated in a tensor form as X C�F wheret refers to epochs (i.e. time-windows) of a given time-series, and c 2 C is a set of electrode channels, each of which has f 2 F feature vector. A similarity matrix S 2 R n�n is the pairwise similarity among instances of a given dimension. For instance, a pairwise similarity among channels is calculated along the electrode channel dimension 8c 2 C. Pairwise relations are best described as graphs since similarity among these instances can be encoded in an affinity graph. We define a graph G = (V, E) where V is a set of vertices or nodes (i.e. electrode channels) and E is the edge-set of G. G is an undirected symmetric graph where an edge e uv is the similarity measure (i.e. correlation) between two nodes (i.e. electrode channels) hu, vi. Algorithm: Given X C�F T , we aim to find a subset of electrode channelsĈ � C without adversely compromising the representability compared to when we use the whole channel space C. In the following, we provide a narrative explanation of the algorithm. An illustrative diagram is shown in Fig 2. Prior to investigating inter-channel functional connectivity, we first construct a neighboring matrix that encodes EEG electrode channels' spatial distribution. The neighboring matrix is denoted as δ 2 R c � c , and each entry is populated by a distance measure between every two channels in the channel space, as given by (3).
In (3), δ is a topological distance function that calculates the shortest path of how far node p is from node q. κ is the unweighted edge distance between any two nodes of the function δ such that it does not exceed a preset maximum value κ max .
Independent of the spatial distribution, we assess how similar (i.e., correlated) electrode channels among each other are in respect to functional neural co-activation among different electrode channels. For that, we measure inter-channel similarities given their feature spaces. Using an efficient similarity measure is greatly vital. Euclidean distance, correlation, and mutual information are reportedly used on such data [23]. Here, we use Pearson's correlation (4).
t Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi VarðX where i and j are any two electrode channels c 2 R f , where f is the feature space, and i 6 ¼ j. Now we cluster electrode channels that synchronously co-activate based on how similar inter-channel attributes are in a given window of time. Electrode channels assigned to different clusters are less similar in terms of their time and frequency attributes. The time-variant channel synchrony increases the complexity of detecting co-activation between different regions of the brain. Therefore, a clustering method that discovers co-activations in all regions without extra presumptions such as a predefined number of clusters or a high dependency on random initialization is favorable. Hence, we apply affinity propagation that iteratively yields data samples to exchange messages in order to conclude, upon convergence, either a data sample is a cluster exemplar or one associated with another data sample identified as an exemplar-based on a similarity matrix S ij . In order to identify exemplars and cluster other data points accordingly, AP defines two matrices [24]: responsibility ρ and availability α as shown in Eqs (5) and (6), respectively.
where i and k are any data sample and potential exemplar, respectively. Eq (6) updates the onand off-diagonal of α. Ultimately, each data sample will be assigned to an exemplar that seems most similar to it and also selected by other samples. Exemplars and their associated data samples form the final clusters. The final co-activation functional clusters resulted from the set of exemplars K in {α kk + ρ kk > 0} and the assignment process 8i 6 ¼ k 2 n are depicted as labeling matrix, L as shown in (7).
With the knowledge of the functional clusters, presented in Eq (7), as well as spatial distribution of channels, we construct a graph G that represents concurrent activation among nodes (i.e. electrode channels). Spatially close electrode channels often read similar signals due to projection of underlying neurons, so we assign lesser weights on their corresponding edges according to their spatial closeness, Eq (3).
This algorithm select different sets of channels in subsequent epochs. Hence, to decide what subject-specific channels should be eventually involved, we evaluate outcomes of subsequent epochs and limit the selection to epochs that exhibit consistency in the way channels co-activate. This reduces the random electric potential that is caused by remaining artifacts and/or non-event related neural activities. Finally, we rank electrode channels based on their consistency through epochs as well as their functional over spatial correlation.

Best-k subject-independent channel and feature selection via mutual information estimation
In an unsupervised fashion, we designed the above pipeline to extract features and select channels and their associated features based on individuals' reception of stimuli presented in the three experiments. The proposed channel selection and the feature selection algorithms rank subjective channels and features based on neurological connectivity across epochs of a stimulus. It is practical to infer the importance of channels and features in real-time to achieve online emotion recognition. Since labeling is not available until the end of an experiment, we count on unsupervised learning to infer the underlying features and channel importance. Once subjective labeling is available, deciding top k channels and features is more feasible. A subset of inter-subject channel space and its extracted features are then selected to train the affect recognition model.
Estimating mutual information (MIE) between individual features and the target variable depicts how two random variables (x, y) are relevant. Given a multivariate data X 2 R x where x represents a feature vector and a target variable y, we estimate MI(x, y) for each feature vector as: where p(.) is the probability distribution function.
Features that are strictly irrelevant to the target variable will weigh zero, hence, eliminated. The higher the mutual information estimation value is, the more relevant the feature is to the target.

Learning and cross-validation
To classify emotions, we use a support vector machine (SVM) to make comparisons with other works fair and to show the merit of the proposed framework. Using grid search, the regularizing parameter C is set in (.025,.5,1,10,100) for both linear and radial basis function (RBF) kernels (RBF's kernel scale = [0.001, 2, 10,50,100]). Since it is not the focus of this work, we do not intensively parametrize or emphasize the parametric outcomes of these kernels.
We used nested cross-validation to eliminate data leakage from the MI-based feature selection to the testing set. The inner cross-validation is for mutual information estimation to select the inter-subject temporal and spatial features ranked in a previous step based on individual subjects. That is an all-but-one training set and a novel subject assigned into the testing set of the outer cross-validation. MIE or minimum redundancy maximum relevance (mRMR) are usually used to search through all combinations of channels and features exhaustively in previous works. Here, since we minimize the spatial and temporal feature spaces earlier in this framework, MIE is now left with a smaller set of features to measure how relevant they are to the target emotion.

Results and discussion
Here, we present the performance of our proposed framework applied to 3 real data benchmarks, namely DEAP, MAHNOB-HCI, and DREAMER. The early phases of the pipeline are epoching, preprocessing, and feature extraction. We used different feature extraction methods in the time, frequency, and time-frequency domains. Then we applied feature and channel selection algorithms to extract the most reliable and relevant features and their sources (i.e., electrode channels). We used Laplacian scoring for temporal features and a proposed algorithm for a subject-specific channel selection. We encapsulated these algorithms in the subject-independent emotion recognition framework, where we learn features in an unsupervised manner until target variables (i.e., subjective labeling) are available. We estimated the relevance of ranked features to two target emotions, valence and arousal, and built a nested crossvalidated model for emotion recognition.

Temporal and spectral feature selection
Although there was some variation in features selected using the Laplacian score, mutual information estimation of features against the subjective labels minimized this variation across subjects. Fig 3 shows features selected more frequently in each stimulus. Only a third of the pool of features were selected in the benchmarks we used in this work.
Features that explain the spectral properties of the power spectrum like the average power of theta, alpha, and beta are of great importance to recognize emotions, see Figs 3 and 4. Additionally, temporal features that infer information directly from the signal in the timedomain were selected, such as statistical moments (i.e., mean and skewness), Hjorth parameters, and methods that measure regulatory and wave unpredictability in time-series (e.g., entropy).
Similar features have been reported in the literature on emotion recognition. The average power of the frequency bands theta, alpha, and beta, and the mean and skewness of the timeseries signal were selected in [25] using Fisher Criterion Score (FCS). Authors of [22] reported their analysis of selecting features based on Chi-square, mutual information, ANOVA F-value, and recursive feature elimination (RFE). They showed that beta and gamma bands are more correlated to the dependent emotion variable. Most of the recent literature on channel and feature selection relied heavily on supervised techniques that would not advance online emotion recognition. Temporal and spatial feature selection algorithms we used and proposed in this work are independent of the subjective feedback except for measuring the relevance between the few ranked channels and features with the target emotion.

Channel selection
We used the channel selection algorithm proposed in this work to rank electrode channels to reduce redundancy and exhibit consistent regional co-activation during the stimulus session. In Figs 5 and 6, we can see that a number of channels between 4-10 are sufficiently competitive based on the results we got here in this study and compared to other studies. We noticed the smaller the epoch size is, the more accurate we can classify emotions. This can be justified by the detailed coverage of features when extracted based on smaller windows of time.
According to the emotion literature, the brain's frontal lobe mainly projects human emotions through neural activation. Most of the studies on public datasets included frontal lobe electrodes such as F3 and F4. Using the feature selection methods mentioned in the previous section, electrode channels and their temporal signals are usually reduced with less consideration to the spatial distribution of channels. Electrode channels and their temporal signals are conventionally transformed to a feature vector before a feature selection is applied. Here, we address the spatial distribution of the multidimensional feature space. We show that our unsupervised results align with the literature in DEAP, MAHNOB, and DREAMER datasets, see Fig 7. Frontal, temporal, and occipital lobes are frequently selected as shown in Fig 7.

Performance comparison
We compare our results with works that studied emotion recognition using a subject-independent scheme, which is considerably more challenging than subject-dependent.

PLOS ONE
We used a SVM classifier to predict the emotional state of a novel subject given their EEG data. We used dual cross-validation and grid search to optimize the SVM kernels and their parameters mentioned in 3.7. We found that the regularization parameter of one (i.e., C = 1) generally gives higher accuracy across datasets. Linear SVM are more stable; hence, the results are based on a linear SVM.
Our results outperformed the existing subject-independent emotion recognition by an improvement of about 16%-27%. We achieved a performance in classifying human affection in an online fashion in different datasets as follows: DEAP (0.88, 0.86), MAHNOB-HCI (0.87, 0.82), and DREAMER (0.89, 0.85) for (valence and arousal). Performance was measured using F1 score due to the imbalanced classes. All studies we compare with but [26] reported their results as aggregated values (e.g., grand mean); hence, detailed statistical analyses for all studies are hard to conduct. We used a one-sided Wilcoxon signed-rank test to test our results against that of [26] given their leave-one-subject-out results of DEAP. It validated that our results are comparatively statistically significant (p <.05) based on the valence results of all stimuli mean of 68 ± 34% compared to that reported here of 88 ± 7%. We show our results compared to the recently published works in Table 1.

Conclusion
We propose a subject-independent framework to recognize human affects by learning individual particularities in an unsupervised manner. That is, the most stimulus-subject-relevant EEG features and channels are selected to identify regions of the brain that activate due to a given stimulus. We then applied mutual information estimation to select features that are more relevant to the target emotion to level out inter-subject variation. Using nested cross-validation and an SVM classifier, we tested the proposed framework on real EEG data, namely DEAP, MAHNOB-HCI, and DREAMER. We were able to classify human affection in real-time as accurate as 0.88, 0.87, and 0.89 for valence; and 0.86, 0.82, and 0.85 for arousal in DEAP, MAHNOB-HCI DREAMER, respectively. This framework also proposes an online analysis solution for affect recognition. Our work achieves a performance increase of 16-27% for valence and arousal compared to the recently published results of emotion recognition using the subject-independent scheme, which is considerably more challenging than subject-dependent schemes.